A CUDA Kernel Scheduler Exploiting Static Data Dependencies

نویسنده

  • Eva Burrows
چکیده

The CUDA execution model of Nvidia’s GPUs is based on the asynchronous execution of thread blocks, where each thread executes the same kernel in a data-parallel fashion. When threads in di↵erent thread blocks need to synchronise and communicate, the whole computation launched onto the GPU needs to be stopped and re-invoked in order to facilitate interblock synchronisations and communication. The need for synchronisation is tightly connected with the underlying data dependency pattern of the computation. For a good range of algorithms, the underlying data dependency pattern is static, scalable and shows some regularity. For instance, sorting networks, the Fast Fourier Transform, stencil computations of PDE solvers are such examples, but parallel design patterns like scan, reduce, and alike can also be considered. In such cases, much of the e↵ort of devising and scheduling CUDA kernels for the computation can be automatized by exposing the dataflow representation of the computation in the program code using a

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Speckle Reduction in Synthetic Aperture Radar Images in Wavelet Domain Exploiting Intra-scale and Inter-scale Dependencies

Synthetic Aperture Radar (SAR) images are inherently affected by a multiplicative noise-like phenomenon called speckle, which is indeed the nature of all coherent systems. Speckle decreases the performance of almost all the information extraction methods such as classification, segmentation, and change detection, therefore speckle must be suppressed. Despeckling can be applied by the multilooki...

متن کامل

A Multi-Stage CUDA Kernel for Floyd-Warshall

We present a new implementation of the Floyd-Warshall AllPairs Shortest Paths algorithm on CUDA. Our algorithm runs approximately 5 times faster than the previously best reported algorithm. In order to achieve this speedup, we applied a new technique to reduce usage of on-chip shared memory and allow the CUDA scheduler to more effectively hide instruction latency.

متن کامل

Avoiding Blocking System Calls in a User-Level Threads Scheduler for Shared Memory Multiprocessors

SMP machines are frequently used to perform heavily parallel computations. The concepts of multithreading have proved suitable for exploiting SMP architectures. In general, application developers use a thread library to write such a program. This library schedules threads itself or relies on the operating system kernel to do so. However, both of these approaches pose a number of problems. This ...

متن کامل

Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators

Hardware accelerators are becoming ubiquitous high performance scientific computing. They are capable of delivering an unprecedented level of concurrent execution contexts. High-level programming language extensions (e.g., CUDA), profiling tools (e.g., PAPI-CUDA, CUDA Profiler) are paramount to improve productivity, while effectively exploiting the underlying hardware. We present an optimized n...

متن کامل

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators

Hardware accelerators are becoming ubiquitous high performance scientific computing. They are capable of delivering an unprecedented level of concurrent execution contexts. High-level programming languages (e.g., CUDA), profiling tools (e.g., PAPI-CUDA, CUDA Profiler) are paramount to improve productivity, while effectively exploiting the underlying hardware. We present an optimized numerical k...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015